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Abstract 

Autoassociative networks were proposed in the 80's as simplified models of memory function in the 
brain, using recurrent connectivity with hebbian plasticity to store patterns of neural activity that can 
be later recalled. This type of computation has been suggested to take place in the CAS region of the 
hippocampus and at several levels in the cortex. One of the weaknesses of these models is their apparent 
inability to store correlated patterns of activity. We show, however, that a small and biologically plausible 
modification in the 'learning rule' (associating to each neuron a plasticity threshold that reflects its 
popularity) enables the network to handle correlations. We study the stability properties of the resulting 
memories (in terms of their resistance to the damage of neurons or synapses), finding a novel property of 
autoassociative networks: not all memories are equally robust, and the most informative are also the most 
sensitive to damage. We relate these results to category-specific effects in semantic memory patients, 
where concepts related to 'non-living things' are usually more resistant to brain damage than those related 
to 'living things', a phenomenon suspected to be rooted in the correlation between representations of 
concepts in the cortex. 
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1 Introduction 

Autoassociative memory networks can store patterns of neural activity by modifying the synaptic weights 
that interconnect neurons [Hopfield, 1982[ |Amit, 1989] , following the simple rule first stated by Donald 
0. Hebb: neurons that fire together wire together |Hebb, 1949| . Once a pattern of activity is stored, it 
becomes an attractor of the dynamics of the system. Evidence of attractor behavior has been reported in the 
rat hippocampus in vivo [Wills et al., 2005] . Such memory mechanisms have been proposed to be present 
throughout the cortex, where hebbian plasticity plays a major role. 

The theoretical and computational literature studying variations of the original Hopfield model [Hopfield, 1982| 
is profuse. Advantages toward optimality or biological plausibility have been demonstrated by varying the 
learning rule, the neuron model, the architecture or connectivity scheme and the statistics of the input 
data. The resulting changes in the behavior of the network, however, are often quantitative rather than 
qualitative. Attractor networks are robust systems that depend only weakly on details. Any optimized 
attractor network, in fact, appears to be able to retrieve a total amount of information that is never more 
than a fraction of a bit per synaptic variable. This limit, consistent with insight obtained with the Gard- 
ner approach [Gardner, 1988[ but never fully proven, implies that the 'storage capacity' of any associative 
memory network is constrained by the number of independently modifiable synapses it is endowed with. A 
suboptimal organization can easily underutilize such capacity, but no clever arrangement can do better than 
that. Crossing the capacity limit induces a 'phase transition' into total amnesia, destroying the attractor 
dynamics that would lead to memory states. 

Subtler memory deficits than an overall collapse have been reported in the neuropsychological literature, 
such as category specific effects in the semantic memory system. Patients with partial damage in the cortical 
networks sustaining semantic memory are found to lose preferentially some concepts rather than others 
(typically animals rather than tools or living rather than non-living things) . Initially, research on these effects 
produced two major antagonistic accounts: the sensory-functional theory [Warrington and Shallice, 1984[ 
[Warrington and McCarthy, 1987[ and the domain specific theory [Caramazza and Shelton, 1998[ . Roughly, 
they hypothesize that different categories of concepts are localized within partially different (the former) or 
completely different (the latter) cortical networks. Damage to particular areas would then produce a deficit 
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in the corresponding category of concepts. Attempts to validate some predictions of these theories have not 
been successful, and an alternative view has emerged in the last few years that, although formulated in various 
ways, basically hypothesizes that the crucial factor to understand category specific effects is the correlation 
among items of semantic information, presumed to be stored in one extended and only weakly heterogeneous 
network jPevlin et al., 2002| [Tyler et al., 20Q0] [Sartori and Lombardi, 20Q4HMcRae et al., 1997| . According 
to this view, random damage to the network would produce selective impairments not because one category 
is more localized within the damaged area than the other, but rather because differences in the structure 
of correlations make some categories more vulnerable to damage than others. This explanation has been 
formulated in a qualitative rather than quantitative formulation. The object of the present study is to fill 
this gap with a theory that produces systematic quantitative predictions applicable, in principle, to these 
and other memory networks storing correlated information. We focus on mathematical models that allow to 
assess the hypothesis in its 'pure' form, without discussing further other accounts of category specific deficits, 
found in the literature, which may of course offer complementary elements to an integrated explanation of 
empirical results. 

Most models of attractor networks consider patterns that, for the sake of the analysis, are generated 
by a simple random process, uncorrelated with each other. Some exceptions appeared during the 80's, 
when interest grew around the storage of patterns derived from hierarchical trees jParga and Virasoro, 19861 
[Gutfreund, 1988| . In particular, Virasoro [Virasoro, 1988| relates the behavior of networks of general archi- 
tecture to prosopagnosia, an impairment in certain patients to identify individual stimuli (e.g., faces) but not 
to categorize them. Interestingly, his model indicates that prosopagnosia is not prevalent in networks endowed 
with Hebbian-plasticity. Other developments have described perceptron-Hke or other local rules to store gen- 
erally correlated patterns [Gardner et al., 1989[ [Diederich and Opper, 19871 [Srivastava and Edwards, 2004[ 



or patterns with specifically spatial correlation [Monasson, 1992 . More recently, Tsodyks and collaborators 



[Blumenfeld et al., 20Q6[ have studied a Hopfield memory in which a sequence of morphs between two uncor- 
related patterns is stored. In their work, the use of a saliency function favouring unexpected over expected 
patterns, during learning, can result in the formation of a continuous one-dimensional attractor that spans 
the space between two original memories. Such fusion of basins of attraction is an interesting phenomenon 



3 



that we leave for a later extension of this work. In this report, we assume that the elements stored in 
semantic memory are discrete by construction. 

In summary, we aim to show here how a modified version of the standard 'Hebbian' plasticity rule enables 
an autoassociative network to store and retrieve correlated memories, and how a side effect of the need to 
use this modified learning rule is the emergence of substantial variability in the resistance of individual 
memories to damage, which, as we discuss, could explain the prevailing trends of category specific memory 
impairments observed in patients. 

1.1 Attractor networks 

Attractor networks are thought to sustain memory at several levels in the cortex and hippocampus, by virtue 
of recurrent connections endowed with hebbian plasticity. Models consider input information to the system 
to be organized into patterns of activity, which the network has to 'remember'. We represent these patterns 
by means of the variables , which stand for the activity of neuron i in the network when pattern /U is being 
fed as an input. The weight of each recurrent synapse is modified following the coactivation of the pre and 
post synaptic neurons. In the simplest model, neurons that were strongly activated by the presentation of 
pattern fj, reinforce their mutual connections, as a result of which if only a group of them is active at some 
time in the future, the others also tend to be activated. In other words, the presentation of a 'cue' causes 
the retrieval of the whole memory, which is a stable firing state of the network, also called an attractor of 
its dynamics. 

While some studies model the learning process itself, in which patterns are presented as inputs and 
synapses modified, others assume that learning has already occurred, so that stable or ideal weights have 
been reached, and analyze the resulting performance of the network. The present work belongs to this second 
group. 

If several patterns are memorized in the same network, the modifications introduced by each of them may 
be added linearly to the weight of synapses. When the total number of stored patterns p is large enough, 
such that neurons and synapses are shared by many different patterns, any attempt to retrieve a memorized 
pattern could suffer from 'interference', understood as the summed effect of the other memorized patterns 
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on the relevant synapses. Theoretical studies have shown that in a network storing random patterns, the 
strength of this interference depends on the parameter a = p/C, where C is the mean number of afferent 
connection weights to each neuron. If the memory load is small and neghgible, a ~ 0, memories are retrieved 
optimally, or, in other words, the original patterns of activity are themselves stable attractors of the system. 
When a is not neghgible but still smaller than a critical value ac (the storage capacity of the network), 
patterns can be retrieved but not optimally. If a partial cue of pattern ^ is presented to the network, its 
activity evolves to a stable attractor state presenting a high but not full overlap with the original pattern. 
The interference is not destructive, but displaces the attractors sHghtly out of their original positions. As a 
increases, approaching etc, this effect is stronger: the overlap between the attractor and the original pattern 
is progressively lower, and the capability to complete partial cues is diminished. In the Hmit of a = ac, 
attractors are stable but the network does not evolve towards them; retrieval occurs only when the cue is 
already the full attractor. Finally, when a > the attractors become unstable and the stored memories 
are no longer retrievable. 

1.2 The model 

We consider a network with N neurons and C < N afferent synaptic connections per neuron. The net- 
work stores p patterns, and the parameter a ~ p/C measures its memory load. As for classical analyses 
|Amit, 19891 , take the 'thermodynamic' limit (p ^ oo, C ^ oo, iV ^ cx3, a constant, C/N constant) in 
which the equilibrium properties of the network depend on a rather than separately on iV, C and p. 

The activity of neuron i is described by the variable tXi, with i — 1...N . Each of the p patterns is a 
particular state of activation of the network. The activity of neuron i in pattern /i is described by , with 
= \...p. The perfect retrieval of pattern ^ is thus characterized by Oi = for all i. For the sake of 
simpHcity, we will assume binary patterns, where — if the neuron is silent and = 1 if the neuron fires. 
Consistently, the activity states of neurons will be limited by < (Ti < 1. Extensions of this work to e.g. 
threshold-linear units [Treves, 1990| or to Potts units [Kropff and Treves, 2005| are left for further analyses, 
though, as usual with attractor networks, there is no reason to expect large differences in the qualitative 
behavior of the system. 



5 



We assume that a fraction a of the neurons is activated in each pattern, a — YnS^f /N for ii — 1 . . .p. 
This sparseness parameter is critical in determining the storage capacity of any associative memory network 
[Treves and Rolls, 199l| . 

Each neuron receives C synaptic inputs. To describe the architecture of connections we use a random 
matrix with elements Cij = 1 if a synaptic connection between post-synaptic neuron i and pre-synaptic 
neuron j exists and Cy — otherwise, with cu = for all i, a requirement for most attractor network models 
to function. In addition, synapses have associated weights . 

The influence of the network activity on a given neuron i is represented by the field 

N 

hi = Cjj Jjj (1) 
which enters a sigmoidal activation function when updating the activity of the neuron 

CT, = {1 + exp P {U ^ hi)}-^ (2) 

where (3 is an inverse temperature parameter and [/ is a threshold parameter, which must be kept of order 1 
(given the appropriate scaling of the weigths that we will adopt) in order to have a storage capacity close to 
optimal [Buhmann et al., 1989[ [Tsodyks and Feigel'Man, 1988| . If C/ <C 1 all the neurons tend to activate, 
somewhat similarly to what happens during an epileptic seizure. If, on the other extreme, U :s> 1, all neurons 
tend to be silent. In both extreme situations the effect of U on the network is much stronger than that of 
the attractors. When U is of order 1, on the contrary, the attractors dominate the dynamics of the network, 
keeping the total activity of the network near the sparseness a even for transient states, independently of 
small variations of U. 

The learning rule that defines the weights Jij in classical models reflects the Hebbian principle: every 
pattern in which both neurons i and j are active contributes positively to Jij. In addition, in order to 
optimize storage, the rule may include some prior information about pattern statistics. In a one-shot learn- 
ing paradigm, with uncorrelated patterns, the optimal rule uses the sparseness a as a 'learning threshold' 
[Tsodyks and Feigel'Man, 1988| , 

Note that this 'classical' rule includes implausible positive contributions when both pre- and post-synaptic 
neurons are silent, and neglects a baseline value for synaptic weights, necessary to keep them positive 
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excitatory weights. Both are simplifications convenient for the mathematical analysis, which have been 
discussed elsewhere (e.g., in [Treves and Rolls, 1991| ) and they will be assumed in the present model as well, 
though, as we will show, the first and more critical one will not be necessary once we introduce our modified 
rule. 

The above rule has been effectively used to store patterns drawn at random from the distribution with 
probability 

Pia=a5{^^^l) + {l-a)Siit) (4) 

independently for each unit i and pattern /i. In such conditions, the storage capacity of the network is 
ttc c>c a~^. This result assumes the limit of low sparseness, a <C 1, which is the interesting case to model 
brain function, Hmit that we will also take in the rest of this paper. 

Patterns that are correlated, unlike what is implied by the probability distribution in Eq. IH cannot 
however be stored effectively in a network with weights given by Eq. [31 For example, patterns intended to 
model correlated semantic memory representations have been considered for a long time 'impossible to store' 
in an attractor network |McRae et al., 1997[[Cree et al., 19991 |Cree et al., 2006| . 

1.3 Network damage in the model 

Semantic impairments can result from damage of very diverse nature, like Herpes Encephalitis, brain abscess, 
anoxia, stroke, head injury and dementia of Alzheimer type, this last characterized by a progressive and 
widespread damage. How can we represent damage in our model network in a general way? 

The model literature on attractor networks shows that the stability of memories depends on the parameter 
a — p/C as explained above, where p can be considered in this case as fixed and equal to the number of 
concepts stored in the semantic memory of a patient. The sparseness a also plays an important role, since 
the critical value of a, or the storage capacity ac, varies inversely to a. In addition, we will show in this 
work that the distribution of popularity across neurons (the fraction of patterns in which each neuron 
i is active) is a crucial determinant of the storage capacity when memories are correlated. However, it is 
interesting to notice that both in the modeUing literature and in this paper, the total number of neurons 
in the network N is not a determinant factor for the stability of memories, as long as it is large enough to 
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apply statistics. 

In our model, random damage to a memory network might affect only C (if the damage is focalized on 
synapses) or N and C in the same proportion (if the damage is focaHzed on neurons) , while the sparseness 
a and the distribution of popularity (see below) should, to a first approximation, remain unchanged due to 
randomness. Since N does not determine the stability of memories, here we simply model network damage 
as a decrease in the number of connections per neuron, C. Interestingly, forgetting in an intact network could 
be thought of as the modification of an increasing number of synaptic weights to values that are uncorrelated 
with the learned ones, and modeled in a similar way. The selective damage of an arbitrary group of synapses 
or neurons, instead, cannot be modelled simply as a decrease in C, and could lead to different and interesting 
results that are, however, outside the scope of this paper. 

2 Results 

2.1 A rule for storing correlated distributions of patterns 

We consider a distribution of patterns in which Eq. |4]no longer applies, although, to simplify the analysis, 
we still assume patterns to have a fixed mean activity, as quantified by the sparseness a (the more general 
case is treated in [ Kropff, 2007| , resulting in a more compHcated analysis but no qualitative changes in the 
conclusions) . We propose a learning rule similar to the one in Eq. [3] with the variant that now learning 
thresholds are specific to each neuron, 

= ^ E (^f - «r") - aD . (5) 

Let us use a signal-to-noise analysis to identify appropriate values for such thresholds. The field in Eq. 
[T]can be split into a signal and a noise part by assuming, without loss of generality, that pattern 1 is being 
retrieved (tTj similar to for all j): 

h.-^ {^l - aD E c., (el - af ^) a, + ^ {^^ - a^^') c,- (^^ - af ^) a, (6) 

j=l M=2 j=l 

where the first term in the RHS is the signal and the second term is the noise. As usual, the signal 
is a single macroscopic term that drives activity toward the desired attractor state, while a sum of many 
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microscopic contributions comprises the noise. To analyze the latter we assume that and ^j* are statistically 
independent variables, as long as i ^ j (whereas we do not require and to be independent; on the 
contrary, the aim is to handle their correlation). If this condition of independence among units, which is 
central to our analysis, is fulfilled, the noise term can be viewed, to a first approximation, as generated by a 
gaussian distribution with mean 

If this mean is different from zero, the noise scales up with p, which is the first cause of the performance 
collapse mentioned above (the optimal one-shot learning rule for uncorrelated patterns has a]J°''* = a^^"^ = a 
for all k, which results in general in a mean noise different from 0). For <C noise ^ in Eq. [7] to vanish, at 
least to leading order in p, we must choose either a^""* =^ or a^"^ =<C We choose the latter 



«r = «^-^Ec (8) 



where we have introduced < < 1, the popularity of neuron i, that measures how shared is the activity 
of this neuron among the patterns in memory. Once this particular choice has been made, one sees from Eq. 
[5] that the contribution of of""* to the field hi vanishes, and its exact value is irrelevant. We then choose 

The next step is to analyze how the variance of the noise distribution scales up with p and C. We have 

< {noise- < noise >= ^ Ci^<^f ^ CijC.kCTjCrk (S.^ - a^) (^[^ - aj.) (9) 

which can be divided into four contributions that scale differently with p and C, depending on whether or 
not j and k on one side and jjL and v on the other are equal: 

p N 



1 2 

^ [noise- <C noise ^ ^ ^ Cijcrl (^j' — aj] 



p N 



ti^iy=2 j = l 



1 " ^ 



M=2 ]^k=l 



p N 

272 E erer E <^^3C^i.<r,au{i';-a,){a-au). (10) 



The first term in the RHS scales Hke (p — 1)/C ~ a, the second one Uke {p ~ l){p — 2)/C, the third 
one Uke {p — 1) and the fourth Hke (p — l)(p — 2). Remembering, however, our definition of popularity in 
Eq. [HI and the statistical independence between neurons, one can see that the leading contributions to the 
second to fourth term vanish. The remaining dependency of the variance on a is similar to the one found 
in classical models of autoassociative memory with independent or randomly correlated patterns, indicating 
that the new rule 

"^'. = ^E^f (^"-«.) (11) 

is a generalization of the Hopfield model appropriate to the storage of correlated patterns. 




500 1000 1500 2000 2500 3000 
N - Size of the network 

Figure 1: The critical value Pmax measured as the value of p at which 70% of the patterns are retrieved succesfully. We show Pmax 
as a function of N using the proportion C — f^.llN for the four combinations of two learning rules and two types of dataset. Violet: 
one shot 'standard' learning rule of Eq. [s] Pink: modified rule of Eq. 1111 Solid: trivial distribution of randomly correlated patterns 
obtained from Eq. Dashed; non-trivially correlated patterns obtained using a hierarchical algorithm. In three cases the scaling of 
Pmax with C is linear, as in the classical result. Only in the case of one-shot learning of correlated patterns there is a storage collapse. 

Figure [U shows simulations of networks of different size and connectivity, employing either the classical or 
our modified learning rule, to store either uncorrelated or correlated memories, as described in Methods. The 
hierarchical algorithm described in [Kropff and Treves, 2007| allows us to construct datasets of different p 
and N values with approximately the same correlation statistics. The four curves result from the combination 
of the two different learning rules, the standard rule in Eq. [3] and the one in Eq. [TTl with two types of 
pattern distribution, correlated or not. With the standard, one-shot learning rule, the number of uncorrelated 
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patterns constructed using Eq. [4] that can be stored and correctly retrieved, Pmax, grows linearly with the 
connectivity C. With non-trivial correlations among patterns, however, the storage capacity collapses: rather 
than scaHng linearly with C, Pmax even decreases toward for very high values of C. This catastrophe is 
reversed when the popularity ai replaces the sparseness a as a learning threshold, bringing Pmax back to its 
usual Hnear dependence on C. The Hnear dependence of course holds also when the more advanced rule is 
applied to the original dataset of uncorrelated (i.e., randomly correlated) patterns. Finally, it is important to 
note that the success in retrieving patterns stored with the rule of Eq. [TT] does not depend on the algorithm 
that we used to construct the patterns, but rather shows the generality of the rule, as we do not include in 
it information about how patterns are constructed. We have tested the modified network with other sets of 
patterns (such as the random patterns in the same Figure or those described in Methods: patterns resulting 
from setting arbitrary popularity distributions across neurons as shown in Figure [3] or patterns taken from 
the semantic feature norms of McRae and colleagues [Kropff, 2007[ [McRae, 2005| ) always reaching levels of 
retrieval that are consistent with the predictions of the theory. 

Having defined the optimal model for the storage of correlated memories, we analyze in the following 
sections the storage properties and its consequences through mean field equations. We note that the average 
of the popularity across neurons is X^^o^j/^ = a ^ 1. In the interesting limit we will consider the 
popularity generally near 0, and only exceptionally close to 1. 

2.2 Retrieval with no interference: a ^ 

If a pattern is being retrieved in a network with very low memory load (a ~ 0) , the interference due to the 
storage of other patterns is neghgible. The field in Eq. [T] is driven by a single term corresponding to the 
contribution of the pattern that is being tested for retrieval (which we call pattern 1), or, in other words, 
the signal term. 




(12) 



This can be re-expressed by defining the variables 




(13) 
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and by noticing that, since N and C are large (in the thermodynamic Hmit both tend to infinity) and is 
a random connectivity matrix, 

1 ^ 

""^^ ^ ^ TV^ ^ ~ ""'^ ' ^-^^^ 
that is, the average of j — aj)aj across neurons. The variable m always refers to the pattern that is being 
tested for retrieval, and it measures its overlap with the state of the network. 
Inserting Eq. [l4]into Eq. [12] we obtain 

~ ^Im. (15) 

This expression can be inserted into Eq. [2] to obtain the updated value of aj for all neurons j = 1 . . . N. If 
the state of the network is stable, aj does not change with updating, so it can be reinserted into Eq. [TH 
yielding a single equation that describes the stable attractor states of the system 



N 

i 

m 



a" 



1 ^ 



(18) 



1 

E (^1 - [1 + ^ - ^»] ' ■ (16) 

Splitting the sum into the aN terms in which = 1 and the (1 — a)N terms in which = 0, we can 
rewrite it as 

m = (1 -a^) {[1 + exp/3(C/-m)]-i - [1 + cxp /3C/]-i} (17) 

where the new parameter < < 1 can be thought of either as the average popularity of the neurons 
active in pattern [x or as the average overlap between pattern /i and the other patterns: 

1 ^ 1 ^ 

Note that for the interesting limit of very sparse activity, in most cases <C 1 . From the definition of m in 
Eq. [14] it can be noted that to = 1 — ~ 1 for perfect retrieval (i.e., {uj} = {Cj}) and m = a — a"" ~ if 
the activity o of the network has sparseness a but is unrelated to i.e., retrieval fails. 

Eq. [17] always admits the solution m = 0, and it may have another stable solution depending on 
two combinations of parameters: (iU and /3(1 — a}). Whenever this non-zero solution exists, retrieval is 
possible. In Figure [2] we show, as a function of the two parameters, the highest value of m that solves 
Eq. [17] A first order phase transition is observed: given a fixed value of jiU there is a critical value of 
/3(1 — a^) below which the only solution to Eq. [17] is to = 0, i.e., no retrieval. In the 'zero-temperature' 
(/3 — > cx)) limit, the condition for the existence of a non-zero solution in Eq. [17] reduces to m = (1 — 
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a^) > U, showing that at the critical point = 1 — C/. Clearly, the choice U = would permit the 
retrieval of patterns with arbitrary values of (which is, by definition, not larger than 1), but as shown 
in jBuhmann et al., 1989| [Tsodyks and Feigel'Man, 1988| and in the following sections, a threshold value of 
order 1 is necessary to obtain an extensive storage capacity, close to optimal, when interference due to the 
storage of other patterns is not negligible. 




2 4 6 8 10 12 



Figure 2: Numerical solutions of Eq. 1171 varying the two relevant parameters; /3(1 — a^) on the x axis and l3U on the y axis. A first 
order phase transition is observed in the value of m that solves Eq. 1171 In the limit /3 — * oo the transition occurs along the identity 
line 1 - = U. 

An intuitive explanation of Figure [2] would be the following. The learning rule in Eq. [11] implies that the 
network is less confident of any neuron j with high popularity, since its positive contributions to outgoing 
weights are proportional to 1 — aj. This implies that the more popular is, on average, the ensemble of 
neurons underlying a given memory (as expressed by its value), the less able it is to sustain, through 
neural activity, the corresponding attractor state. When the average activating signal is smaller than the 
threshold U, retrieval is no longer possible. 
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2.3 Retrieval with interference: diluted networks 

To treat the case of extensive storage, p scaling up with C, we consider the so called highly diluted approxima- 
tion, which is valid when either C <^ N ('diluted', i.e. sparse connectivity proper, [Derrida et al., 1987| ) or 
a (very sparse activity, [Treves and Rolls, 1991| ). There are two independent motivations to study such 
a limit: on one side it approximates real cortical networks, with their sparse connectivity and sparse firing, 
on the other, calculations are much simpler than for fully connected networks, enabling deeper analysis and 
wider generaHzation. In addition, one obtains in this limit differential equations for the dynamical evolution 
of all relevant variables, valid also outside of equilibrium [Derrida et al., 1987| . Such an approach is outside 
the scope of this paper, and it is left for future studies. It is worth mentioning that some experimental work 
on semantic memory [Sartori and Lombardi, 20041 [Sartori et al., 200^ is based on a dynamical view of the 
networks involved in semantic processing, as it focuses on the type of input cues that can lead to successful 
retrieval. 

The highly diluted approximation takes into account in the field hi a signal term and a gaussian noise, 
while neglecting the effect of a second source of noise due to the propagation of neural activity around 
closed loops of synaptic connections. These effects scale in general Hke aaC/N [Roudi and Treves, 2004[ 
[Kropff, 2007[ , and are therefore negligible as C/N ^ 0, a ^ 1 or, as in the previous section, a ~ 0. 

In Eq. [To] we had already obtained an expression of the variance of the noise part of the field hi when 
considering it to be purely gaussian. After computing the average over fi in the surviving first term, we 
obtain 



<C {noise— ^ noise '3>)^ ^= a fli 



1 ^ 



(19) 



The expression between square brackets depends on i only through the connectivity matrix Cy . As in Eq. 
fT4l we can take advantage of the fact that Cij is random and C large, and replace the sum with an average 
over all neurons. We can conclude that ^ {noise— <C noise ^= aaiq, where we define 

1 ^ 

'?=]^E«^(l-«^>'- (20) 

i=i 

The local field then becomes 

hi = ^Im + y/aaiqzi (21) 
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where Zi may be assumed to be drawn from a normal distribution with mean and variance 1, statistically 
independent with all other variables^. To describe attractors of the system, as previously, we insert the field 
into Eq. [2] to obtain the stable value of (jj , which can be re- inserted into the definition of m in Eq. [T4l 

1 ^ 

Making use of the independence of Zj with respect to aj and ^j, we can take its average. The highly diluted 
version of Eq. [16] is then 

= ]^ II fe' " "j) j ^^[l + exp/3(C/-^]m-ySa-gz)]"' (23) 
where the gaussian differential is 

I^..d.^exp(4) (24) 

expressing the distribution of Zj . 

In the following, for simpHcity, we will take the limit of zero temperature, /3 ^ oo. The equation for m 
becomes 

where 



0(.)=2^1 + erf^-^JJ (26) 

is a sigmoidal function increasing monotonically from to 1, with cf){Q) — 1/2. Since in Eq. [25] the terms 
are not linear in aj , it is not straightforward to obtain the new version of Eq. [TT] To do so we must first 
introduce the distribution of popularity across neurons, given by the probability 

F{x) = Pia,^x), (27) 

and the distribution of popularity across neurons that are active in the pattern we are testing for retrieval, 

/(x)^P(a, =a;|e] = l). (28) 



^In the simplest signal-to- noise approach [Kropff and Treves, 2005| two 'worst-case' conditions must be met in order to have 



stable attractors: hi = m — y/ variance > U for values of i in which £,1 = 1 and hi = V variance < U for = 0. This shows that 
the optimal value of U is m/2 ~ (1 — a'')/2, which depends on global rather than local information. Interesting corrections in 
which the optimal value of U depends on and is thus different for each neuron might come out of considering the non-diluted 
case, including an additional term in the local field hi as mentioned above. 
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The purpose of introducing these distributions is to convert a discrete set of popularities {aj} into a contin- 
uous distribution, where the popularity is represented by the variable x. Since N is large, we can transform 
the sum in Eq. [25] into an integral over these distributions. As a result we obtain the equation 

m = f dxf{x) 1(1- x)(f) + (^^^ 



^ ^ dxF{x)x(t> ( ~^ \ , (29) 



/axq ^ 

which extends Eq. [17] to the case of non negligible interference. 

Since this equation depends not only on m but also on q, we need a second equation to close the system 
and univocally describe the stable states of the network. From the definition of q in Eq. [20] we can repeat 
the steps [22] to [25] and obtain, for stable states and in the limit of zero temperature. 



1 ^ 



No? 



^Jaajq 

Introducing again the distributions of popularity - steps [25] to [29] - we can simplify this expression into 



(30) 



i dxf{x)x{l x) \^ f - <^ (-E= ] \ + 

+ \ I dxF{x)x{l- x)(t){-^^^=] . (31) 



a- Jq \ ^axq J 

Eqs. [29]and[3T]describe the stable states of the network in this 'diluted' approximation. As in the noiseless 
case, a phase transition separates regions of parameter space where a solution with m ~ 1 — exists from 
regions where the only solution is m = q = 0. The latter can now be reached by increasing a — p/C, i.e. 
the memory load. In other words, the phase transition to no retrieval determines the storage capacity of the 
system. If f[x) — F{x) — 5{x — a), which is the case for uncorrelated patterns, the classical equations for 
highly diluted binary networks [Buhmann et al., 1989[ [Tsodyks and Feigel'Man, 1988| are re-obtained, and 
the critical value of the memory load scales like 

(X , j^, . (32) 

aln(l/a) 

for the relevant sparse limit a ^ 1. 

How does this classical result generalize to the case of correlated representations? 
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2.4 The storage capacity 

Already at first glance, the system of Eqs. [29] and [Ml which determine the storage capacity of a network 
with correlated patterns, reveals a new property of associative memories. In both equations, the second term 
in the RHS depends on F{x) and is thus common to the retrieval of any pattern. However, the RHS of both 
equations depends also on /(x), the distribution of popularity among neurons active in the pattern that 
is being retrieved. In the general case, this distribution is different for every pattern, so that the stability 
properties of the associated attractors will differ from pattern to pattern. 

To understand this idea it is convenient to think about the storage capacity as p/Cmin (the minimum 
connectivity necessary to sustain retrieval) rather than as Pmax/C (the maximum number of patterns that 
can be stored). In this view, each of p memory states stored in a network has an associated value of Cmin 
that depends on its own statistical properties and on the statistical properties of the whole dataset. Any 
particular pattern can be retrieved only if the actual connectivity level C is higher than the value of Cmin 
associated to it. 

This view is of particular interest to analyze category specific deficits in semantic memory. We can think 
oip as being relatively fixed, corresponding, in the model, roughly to all the concepts acquired by a healthy 
subject during an entire life. A mild and non-selective damage of the network might decrease the parameter 
C, which would selectively affect the memories with a high value of Cmin, while sparing the others. 

2.4.1 An entropy characterization of the noise 

To analyze Eqs. [291 and [31] we first consider that a and U are small enough to ensure that the retrieval is 



retrieval has to ~ 1 — a^, as we had found for a ~ and a value of the noise variable q that is proportional 
to the average of 0^(1 — aj) over the neurons that are active in the pattern (as can be seen from Eqs. [30] or 
\3^, or in other words. 



Jo 

Similarly to Shannon's entropy, Sf, and in consequence the noise variable q, approaches if neurons in the 
distribution are all either very popular or unpopular in their firing, while it is maximum (Sf = 1/4) when 




Following this, any pattern that we choose to test for 




(33) 
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/(x) — 8(x — 1/2), i.e. all neurons have popularity = 1/2 q Thus, a pattern will be better retrieved if 
a) it includes as unpopular neurons as possible (as shown previously, to ensure m,— \~a}->U) and b) its 
neurons have a low 'entropy' value -S*/, in order to minimize the noise q ~ Sf/a. 

An intuitive explanation of this comes from the analysis of the influence of neuron j as noise in the fleld 
hi, proportional to X^^^iC^t^j* ~ ^i) shown in Eq. [6l If the popularity of neuron j is very low, terms 
of this noise where ^j* = 1 are large contributions (proportional to 1 — Oj), but very infrequent, while terms 
in which ^j* = are very frequent but only proportional to aj <C 1. The exact opposite pattern emerges if 
neuron j is very popular. As a result of this, in both cases the noise is very low. In the extreme of aj = 
or flj = 1 the noise is exactly zero, since contributions of order 1 occur with probability and inversely. 
In such a case the dynamics of the network is guided purely by the signal terms, that take hi toward the 
correct value for retrieval. The case in which the noise is maximal is when the probability of neuron j to be 
active is aj = 1/2 and each term of the contribution of neuron j to the noise in the fleld hi is proportional to 
1 — flj = 1/2 or aj = 1/2. Finally, since the noise is also proportional to dj and pattern 1 is being retrieved, 
this effect is important only for the neurons j that are active in this pattern, explaining fully Eq. [33l 

2.4.2 The storage capacity is inverse to Sf 

As a increases, the assumption (f) [(m — U)/ ^axq\ ^ 1 becomes eventually incorrect and for some critical 
value ac a retrieval solution with m ~ 1 — no longer exists. A generally fair approximation when studying 
storage capacity is to assume that ac scales inversely to the factor that accompanies a in the argument 
of 4>, which in this case is xq. However, since a; is a variable that spans the whole range from to 1, the 
approximation is not useful in itself. In more general terms, ac should scale inversely to Xfq, with < Xf < 1 
some intermediate value with a strong dependence on f{x). In this section we consider the case in which 
the variance of F{x) is small enough to allow the approximation of x by its average a in the argument of 0, 
while in Methods we analyze some more general examples. 

Our flrst order approximation, assuming a inverse to aq and q ~ Sf/a, leads to 



^Technically, this function applied to a single unit is Tsallis' entropy with parameter q = 2. Note, however, that Tsallis' 
entropy is not additive for independent events, while our Sf is clearly a normalized extensive quantity. 
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1 



(34) 



In line with what we had explained intuitively, the storage capacity, or Cmin/p, is inverse to the entropy Sf 
of the pattern. In the classical case of randomly correlated patterns Sf — a{l — a) ^ a (again, assuming 
cortical activity to be sparse, the interesting approximation is always a ^ 1), which leads to the Tsodyks 
and Feigel'man result in Eq. [32l without the logarithmic correction. 

This correction appears only when (^—U/y/aaq) starts to be significantly different from 0. The largest 
contribution is the one given by the second term in the RHS of Eq. [3ll since it is not negligible when 
<j) {—U/ ^aaq) is of order a (considering a <C 1), while the other neglected terms are only relevant when 
[—U/y^aaq) is of order 1. Again, we use the approximation of low variance, so the term we are interested 
in becomes 



To 



1 



-U 



dxF(x)x{l — x) = —4> 



-U 



S f 



(35) 



/aaqj Jq \y/aaq^ 
where, similarly to Sf, we define Sf as the entropy of the distribution F{x). This term is near for very 
small values of a, where q is dominated by the first term of Eq. [STJ which can still be considered as Sf/a, 
and it becomes significant only when both terms are of comparable magnitude. If this happens at values of 
a that are smaller than the one indicated by Eq. [SH the correction introduced by this term is relevant. To 
estimate this correction we impose the first and second terms of Eq. [31] to be about equal (T ~ Sf/a) and 
consider a <C 1, which leads to 



U 



\/(^cSf 



aSf_ 
Sf ■ 



(36) 



Inverting the function <j) we obtain etc as 



2Sf 



U 



cir^ ( 1 



2aSf 



(37) 



The inverse error function can be approximated as 



erf-'(l-!/) 



In 



2 1 

ny 



(38) 
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for small values of y. Since F(x) has low variance, S f , Sp ^ a 1 and aSf/Sp can be taken to be a small 
quantity. We then approximate 



1 

257 



C/2 



In 



/2naS 



OC 



(39) 



If this scaling of Uc is lower than indicated by Eq. [34] (or, in other words, if hi^Sp / {aS /)) > 1) this correction 
is relevant. Finally, in the case of trivial correlations f{x) = F{x) — S{x — a) and consequently S f — Sp — a- 
The full classical result of Eq. [32] is then reproduced by Eq. [39l indicating that the latter is a generalization 
of the former. 
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Figure 3: Simulations of the storage capacity of a network storing patterns with an arbitrary correlation distribution F(x). The 
parameters are A'^ — 500, p — 50, a — 0.1, U — 0.35 and variable C. For all values of C each pattern is tested 10 times for stability, with 
different connectivity matrices Cij . a Popularity distribution across the whole network, F{x). Note that neurons with ai — do not 
really participate in network dynamics, making the effective values of C and N slightly lower, b Stable value of m for each pattern vs. 
its Sf value. The data has been smoothed by taking the median over a moving window. From blue toward violet: connectivity C/N 
starting with 1 and decreasing in steps of 0.05. For each color, the graph shows that some patterns are retrieved while others are not, 
corresponding to low and high values of Sf. The critical value of Sf at which the transition occurs moves to the left as the connectivity 
is reduced, which, as explained in the Introduction, is the strongest effect of random network damage, c Storage capacity computed 
from the step-like transitions in b. Black dots, left axis: critical value of Sf vs. connectivity, showing the maximum retrievable Sf 
supported by the C connections of the network. Red line, right axis: percent of patterns with a value of Sf lower than the critical one. 
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In Methods we find expressions similar to [39] for wider distributions of F{x). As we show, the slower 
the decay of the tail of a smooth distribution F{x) with increasing x, the poorer is performance in terms of 
storage capacity. If the decay of F{x) is exponential or faster, the 1/5/ scaling of Eq. [39l holds with at most 
a larger logarithmic correction. If the decay is a power-law, instead, the scaling is much poorer: ac oc a/Sf, 
with, as usual, a <C 1. 
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Figure 4: Distribution of in concepts belonging to the 'living' and the 'non living' categories obtained from the feature norms of 
McRae and colleagues |McRae et al., 2005| . Living things have a distribution centered at higher values of Sf, which in terms of our 
analysis means that they are more informative but also more susceptible to damage, as observed in patient studies. 

2.4.3 Informative memories are less robust 

In Figure[3]we show results of simulations using a distribution of correlated patterns (see details in Methods), 
focusing on how the successful retrieval of a pattern depends on its Sf value, and how a decrease in C results 
in the selective lost of memories. This illustrates how the effective memory load of a network depends not only 
on the number of patterns that are being stored but also on how informative they are. An autoassociative 
memory could store virtually infinite patterns, for example, if they were constructed in such a way that all 
of the neurons contributed vanishing entropy, and hence were minimally informative: this would be the case 
if some neurons were active in nearly every pattern, while others in none, keeping the mean activity fixed 
to a value a. This result is in agreement with the notion that any associative memory network is ultimately 
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constrained in the amount of information each of its synapses may store [Gardner, 1988| . 

The other interesting aspect of Eqs. [29] and [31] is that memory patterns are rather independent from 
one another in their retrievabihty. In the process of lowering C (which is, as explained in Introduction, the 
strongest effect of random network damage in our model) any pattern with a low value of 5/ would be 
retrieved even when most of the other patterns have become irretrievable. Generally speaking, informative 
memories are lost, while non-informative ones are kept. 

This model thus offers a quantitative explanation of category specific effects, along principles similar to 
those suggested, in a non mechanistic way, by several previous studies [Tyler et al., 2Q0Ql[Sartori and Lombardi, 20Q4[ 
[McRae et al., 1997| . In our network, the classical dichotomy would be verified if the semantic representations 
of living things had on average higher values of Sf than those of nonliving things, a plausible assumption 
that can be assessed using evidence in the relevant literature. As an example, we analyze the feature norms 
of McRae and colleagues, experimentally obtained representations of 541 concepts in terms of 2526 features 
[McRae et al., 20Q5] (see Methods). In Figure [4] we show that the distributions oi Sf in the two categories 
overlap, but they are centered around different values of Sf, with living things on average more informative, 
hence more vulnerable to damage - a trend that is consistent with our analysis [^ . 



3 Discussion 

Several experimental studies investigating semantic memory from the perspective of feature representation 

suggest that the representation of concepts in the human brain present non-trivial correlations [Vinson and Vigliocco, 20021 



Garrard et al., 2001] , presumably reflecting to some extent non-trivial statistical properties of objects in the 
real world or in the way we perceive them. It has not yet been proposed, however, how a plausible memory 
network could store reliably such representations; while attempts to model the storage of feature norms 
(experimentally obtained prototypes mimicking concept representations) with attractor networks have had 



^One could feel tempted to store the patterns obtained from these norms in a network in order to simulate damage in 
a more direct way. Some new technical problems arise, however, since the sparseness a is not constant across patterns. In 
addition, the performance of the network is very poor due to the fact that the popularity distribution of the norms F(x) has a 
power-law decay. This poor performance does not contradict the theory developed here, but rather validates it, as elaborated 
in [Kropff, 2007] . 
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success only using small sets of memories [McRae et al., 19971 |Cree et al., 19991 jCree et al., 2006] . We pro- 
pose here a way in which a purely Hebbian autoassociative memory could store and retrieve sets of correlated 
representations of any size, using a number of connections per neuron C that increases proportionally with 
P- 

Interestingly enough, our learning rule is not quite appropriate for a one-shot learning process, since 
it requires to calculate statistical properties of the dataset - the popularity of neurons - before learning 
the patterns. In the case of semantic memory, concepts are acquired through a long time experience 
and through the repeated exposure to diverse versions of the input, allowing, if necessary, for a contin- 
uous updating of popularity estimates. Episodic memory, on the other hand, requires one-shot learn- 
ing, leaving no time for a learning rule like ours to deal with the correlation between memories. As- 
sociative networks may have evolved in other directions to enable the on-line storage of episodes and 
events. Evidence has recently been obtained [Leutgeb et al., 2007| supporting the suggestion that the den- 
tate gyrus acts as an orthogonalizing device in the heart of the medial temporal lobe episodic memory 
system [Treves and Rolls, 1992| . The hippocampus could then function as an orthogonalized buffer, that 
helps neocortical networks acquire correlated memories through an off-line process. It has been proposed 
|Marr, 197lHWiIso n and McNaughton, 1994] [Hinton et al., 1995] that it is during sleep that the hippocam- 
pus transfers to cortical areas the statistical biases of the input, in a process of consolidation. While one-shot 
learning of a large dataset of orthogonal or randomly correlated patterns can be achieved through the 'stan- 
dard' rule of Eq. [31 the learning or stabilization of correlated memories in their final cortical destination 
may be consolidated by a learning process that reflects what in our model we have deflned as the popularity 
of different neurons. Such consolidation may well accompany the spontaneous retrieval of representations 
stored in the hippocampus [Squire and Zola-Morgan, 1991HMcClelland et al., 1995| . 

Our results show that correlated representations can be stored at a cost: memories lose homogeneity, some 
remaining robust and others becoming weak in an inverse relation to the information they convey. These side 
effects should be observed in any associative memory system that is understood to store correlated patterns 
directly, and absent if information is first equalized through pattern orthogonalization. 

Conversely, one may ask: are there benefits in representing correlated memories as they are, without re- 
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coding them into a more abstract, orthogonalized space? We have shown in a previous study [Kropff and Treves, 2007| 
that correlation plays a major role in driving a latching dynamics in a model of large cortical networks, in 
a process that could be a model of free association, and that might also underly the capacity for language 
[Treves, 2005| . Also, semantic priming has been shown to be guided by correlation [Vigliocco et al., 2004[ 
|Cree et al., 1999[ , selectively facilitating or inhibiting the retrieval of concepts, and potentially compensat- 
ing for impaired episodic access [Ciaramelli et al., 2006[ . On the other hand, embodied theories of cognition 
suggest that far from creating a neural structure of its own, the semantic system evolved on the same 
neural substrates that already had a primary function (visual, tactile or motor processing, etc.), for which 
correlation in the representation, even if useful, would be an inevitable outcome of their history. 

Some predictions of our theory could perhaps be tested experimentally. The most immediate result to 
test is the relationship between the distribution of patterns and their relative robustness. The distribution 
of neural activity of different memory representations is however not available, for obvious technical reasons. 
Imaging techniques do not offer the required resolution, and collecting adequate statistics from single unit 
recordings in animals appears prohibitive. Nevertheless, other measurable quantities could yield an estimate 
of relevant statistical properties of the distribution: priming effects, for example, are related to the correla- 
tions between memory items. A second way to test the theory could be to assess the retrieval of a memory 
by a partial cue, similarly to what has been proposed in jSartori and Lombardi, 2004[ , where the authors 
associate retrievability with a particular statistical measure: the semantic relevance of the cue. A third 
possibility could be to measure the speed of retrieval, which can be related to Eqs. [29] and [31] and, again, 
to the specific cue that the network receives to trigger recall. In this last case, however, retrieval activity in 
the semantic system should be isolated from other processes, such as categorization, which could take place 
automatically, affecting the overall timing. Probing different systems other than semantic memory might also 
be a possibility, since our conclusions are general to any associative network with correlated memories. If a 
set of stimuli with controlled correlations were to be constructed (for example a set of pictures of caricature 
faces with exchangeable features), the memory of subjects trained with these stimuli could be tested for 
retrievability. The time-to-forget should then be related to the robustness, and inversely to the information 
content of each item, while with orthogonalized representations forgetting should be equalized. 
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4 Methods 



4.1 Sets of patterns used in simulations 

In the simulations shown in Fig. [T] a hierarchical algorithm was used to generate the patterns. The main 
idea is to produce, in the first place, a generation of random 'parent' patterns which are not part of the 
dataset but are used to influence with different strength a second generation, (more details and a full 
analysis of the statistics of the resulting patterns can be found in jKropff and Treves, 2007| ). The reason 
to use this particular algorithm is that we needed a distribution of patterns with approximately the same 
correlation properties independently of p and N. Following our studies in [Kropff and Treves, 2007| , this is 
the case with the above algorithm, as long as p and N are not too small and asymptotic statistics appHes. 

For the simulations in Fig. [3] we needed higher levels of correlation than the ones that we could obtain 
with the algorithm described above, so as to illustrate the effects of large variability in the Sf values of the 
patterns. On the other hand, we did not require in this case patterns with more than one value oi p and N. 
We then chose an algorithm that sets approximately an arbitrary popularity distribution over neurons. We 
chose 



as the target distribution of popularity F{x), with {P{ai)) ~ a. Since the total number of patterns is p, we 
defined the function 



expressing, when rounded to the closest integer, how many neurons should be active in k patterns. For 
values of Uk > 0.5, we assigned a target popularity = k/p to round{nk) arbitrary neurons. To construct 
each pattern /i we initially set all neurons in the pattern to be inactive. Then we picked neuron i at random 
and set = 1 with probability Pi, until aN neurons had been set to be active for each pattern. Finite 
size effects caused the actual distribution of popularity, shown in Fig. [3^, to be slightly different from the 
target one in Eq. [40l specially for low values of popularity. Since this region of the distribution is the less 
interesting one (see Section |43|) . we did not modify the patterns further. 

The feature norms analyzed in Fig. [4] were downloaded from the Psychonomic Society Archive of Norms, 




(40) 



Hk = NP {k/p) 



(41) 
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Stimuli, and Data web site, www.psychonomic.org/archive, with the consent of the authors. The norms list 
p — 541 concepts relating several of = 2526 features to each one of them. To each concept we associated a 
index and to each feature a i index. We set = 1 if feature i was included in the description of pattern ^ 
and = otherwise. Since not all patterns are associated with the same number of features, the sparseness 
is not constant across patterns. The average sparseness is o ~ 0.006 equivalent to ^ 15 features per concept. 
For each concept, 5*/ is calculated as the average value of a,i(l — a^) among the features that comprise it. 

4.2 Testing the stability of memories 

The stability of a memory item should be tested irrespective of how accurate a cue it needs in order to be 
retrieved. For this reason, we used the full original pattern as a cue, which is a good approximation of its 
attractor. The initial state, thus, is set to coincide with the tested pattern. In each update step, a neuron 
i is chosen at random and updated using the rule in Eq. [21 keeping track of m, whose initial value is close 
to 1 by construction. Initially, m varies rapidly, but it eventually converges to a stable value, either near 
1 or near 0. A proof of this is the step like transition in the stable values of m, shown in Figure [Sja. The 
simulation stops when the variation of m is smaller than a threshold, which we set small enough to give 
three digits accuracy in m. 

4.3 Storage capacity of more general distributions 

As we have shown in Results, the important quantity to estimate in order to find the scaling of the storage 
capacity of a memory network with correlated patterns is the second term in the RHS of Eq. [31] 

T2 = \ f dxF[x)x{l - x)(j) (^^^ ■ (42) 
Jo \y/axqj 

The factor (/) {—U j ^JTxxq^ is when a; = and reaches its maximum when x = \. On the other side, since we 

consider the sparse limit a <C 1 the distribution F{x^ is concentrated toward small values of x. For these two 

reasons, the interesting part of any smooth distribution function F(x) is the decay of its tail with increasing 

X. We study in this section two interesting cases: exponential and power-law distributions. Keeping in mind 

that the exact behavior of Fix) for small values of x is less relevant, these results can be generalized to any 

distribution function with such tails. 
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4.3.1 Exponential distribution 



The exponential distribution 



F{x) 



exp(— x/a) 



(43) 



is normalized to 1 and has mean equal to a - apart from a small correction of order exp(— 1/a), which we 
neglect for simplicity. Its variance is about a^, with a correction of the same order. Finally, Sf ~ a(l — 2a). 
The critical second term in the RHS of Eq. [31] is 
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(44) 



where we have inverted the integration order. Dz is the gaussian differential defined in Eq. [24] and y = 
U'^a/{aS f). The inner integral in the right-most side of the equation confirms that the value of F{x) for 
small X is less relevant than its decay for large x. The RHS is now integrable, resulting in 
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This expression can be integrated a second time, but its analytical expression is too complicated to include 
here. It is enough to mention that the largest contribution is proportional to exp (^—y/2y/aj 
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Assuming 2y/a ^ 1 modulo some logarithmic correction (that we consider inside the exponential and neglect 
elsewhere) this results in 

r...xp(-/f)j^5., ,47) 

Since only y depends on Uc it is easy to see from this equation that indeed 2y/a ~ 1 modulo logarithmic 
corrections, making the previous assumption self-consistent. The storage capacity can be obtained by making 
the RHS of Eq. [47] as in the previous section, equal to Sf/a, 

2U^ 1 



(48) 



Note that the square on the logarithmic factor makes this storage capacity lower than the one found for 
F{x) distributions of very low variance. Again, the correction is valid as long as the logarithm is large, in 
other words \n{Sp/aSf) > 1. If this condition is not met, the storage capacity scales like l/Sf. 
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4.3.2 Power law distribution 



We define the power law distribution 



F{x) = < 







if a; < 



(49) 



cx if X > d 

with 7 > 2 and d a small cutofT value that prevents the integral of -^(a;) from diverging. The conditions for 
normalization and mean are 



1 = c 



a = c 



7-1 
d^-- - 1 
7-2 



There is no simple analytical expression for c, d or Sp in terms of a and 7. 
We want to compute 

T2 = ^ [ dx c x~'^x(l — x)(b 
where, again, y = U'^a/{aSf). T2 is integrable, resulting in 
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where r[, ] is the incomplete gamma function. The following series expansions are useful 
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T2 is different from only to order y ^ inside the curly brackets. At this order of approximation 
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neglecting a similar term including the factor exp (— ^g) . As previously, the storage capacity can be 
estimated as 

oc ^-^^ (56) 
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where we have used c cm a' . If the logarithm is of order 1 or smaller the storage capacity scales simply 
like a/Sf. 
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